Mark Dunning and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy John Davey, Dávid Molnár and Ian Roberts
Last modified: 07 Sep 2015
https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
http://www.revolutionanalytics.com/companies-using-r
R)To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click
setwd("R_course/Day_1_scripts")
2 + 2
## [1] 4
20/5 - sqrt(25) + 3^2
## [1] 8
sin(pi/2)
## [1] 1
Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…
<-x <- 10
x
## [1] 10
myNumber <- 25
myNumber
## [1] 25
sqrt(myNumber)
## [1] 5
x + myNumber
## [1] 35
x <- 21
x
## [1] 21
x <- myNumber
x
## [1] 25
myNumber <- myNumber + sqrt(16)
myNumber
## [1] 29
sin(x)
this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.
sum(3,4,5,6)
## [1] 18
max(3,4,5,6)
## [1] 6
min(3,4,5,6)
## [1] 3
seq(from = 2, to = 20, by = 4)
## [1] 2 6 10 14 18
seq(2, 20, 4)
## [1] 2 6 10 14 18
c combines its arguments into a vector:x <- c(3,4,5,6)
x
## [1] 3 4 5 6
[] indicate the position within the vector (the index). We can extract individual elements by using the [] notation:x[1]
## [1] 3
x[4]
## [1] 6
y <- c(2,3)
x[y]
## [1] 4 5
x <- c(3,4,5,6,7,8,9,10,11,12)
x <- 3:12
x
## [1] 3 4 5 6 7 8 9 10 11 12
seq() function, which returns a vector:x <- seq(2, 20, 4)
x
## [1] 2 6 10 14 18
x <- seq(2, 20, length.out=5)
x
## [1] 2.0 6.5 11.0 15.5 20.0
rep() function:y <- rep(3, 5)
y
## [1] 3 3 3 3 3
y <- rep(1:3, 5)
y
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
x <- 3:12
x[3:7]
## [1] 5 6 7 8 9
x[seq(2, 6, 2)]
## [1] 4 6 8
x[rep(3, 2)]
## [1] 5 5
y <- c(x, 1)
y
## [1] 3 4 5 6 7 8 9 10 11 12 1
z <- c(x, y)
z
## [1] 3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10
## [19] 11 12 1
x <- 3:12
x[-3]
## [1] 3 4 6 7 8 9 10 11 12
x[-(5:7)]
## [1] 3 4 5 6 10 11 12
x[-seq(2, 6, 2)]
## [1] 3 5 7 9 10 11 12
x[6] <- 4
x
## [1] 3 4 5 6 7 4 9 10 11 12
x[3:5] <- 1
x
## [1] 3 4 1 1 1 4 9 10 11 12
Remember! Square brackets for indexing [], parentheses for function arguments ().
x <- 1:10
y <- x*2
y
## [1] 2 4 6 8 10 12 14 16 18 20
z <- x^2
z
## [1] 1 4 9 16 25 36 49 64 81 100
y + z
## [1] 3 8 15 24 35 48 63 80 99 120
x + 1:2
## [1] 2 4 4 6 6 8 8 10 10 12
x + 1:3
## Warning in x + 1:3: longer object length is not a
## multiple of shorter object length
## [1] 2 4 6 5 7 9 8 10 12 11
gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
names function, which can be useful to keep track of the meaning of our data:gene.expression <- c(0, 3.2, 1.2, -2)
gene.expression
## [1] 0.0 3.2 1.2 -2.0
names(gene.expression) <- gene.names
gene.expression
## Pax6 Beta-actin FoxP2 Hox9
## 0.0 3.2 1.2 -2.0
names function to get a vector of the names of an object:names(gene.expression)
## [1] "Pax6" "Beta-actin" "FoxP2"
## [4] "Hox9"
| Species | Genome size (Mb) | Protein coding genes |
|---|---|---|
| Homo sapiens | 3,102 | 20,774 |
| Mus musculus | 2,731 | 23,139 |
| Drosophila melanogaster | 169 | 13,937 |
| Caenorhabditis elegans | 100 | 20,532 |
| Saccharomyces cerevisiae | 12 | 6,692 |
c function. Create a species.name vector and use this vector to name the values in the other two vectors.coding.bases.coding.bases and genome.size vectors to calculate this. (See earlier slides for how to do division in R.)genome.size <- c(3102, 2731, 169, 100, 12)
coding.genes <- c(20774, 23139, 13937, 20532, 6692)
species.name <- c("H. sapiens","M. musculus",
"D. melanogaster","C. elegans",
"S. cerevisiae")
names(genome.size) <- species.name
names(coding.genes) <- species.name
coding.bases <- coding.genes*0.0015
coding.bases
## H. sapiens M. musculus D. melanogaster
## 31.1610 34.7085 20.9055
## C. elegans S. cerevisiae
## 30.7980 10.0380
coding.pc <- coding.bases/genome.size*100
coding.pc
## H. sapiens M. musculus D. melanogaster
## 1.004545 1.270908 12.370118
## C. elegans S. cerevisiae
## 30.798000 83.650000
coding.bases[1]/coding.bases[5]
## H. sapiens
## 3.104304
genome.size[1]/genome.size[5]
## H. sapiens
## 258.5
coding.pc) but sometimes it is not (when we are comparing human to yeast). We can remove names by setting them to the special NULL value:names(coding.pc) <- NULL
coding.pc
## [1] 1.004545 1.270908 12.370118 30.798000
## [5] 83.650000
Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.
x <- 2 + 2
print(x)
Sourcing can also be performed manually with source("myScript.R")
? followed by the function name. For example:?seq
example function:example(seq)
?? followed by your guess. R will return a list of possibilities:??plot
; end of line (Enables multiple commands to be placed on one line of text)# comment (indicates text is a comment and not executed)+ command line wrap (R is waiting for you to complete an expression)sum() is in the base package and sd(), which calculates the standard deviation of a vector, is in the stats packageinstall.packagessource("http://bioconductor.org/biocLite.R")
biocLite() function:biocLite("PackageName")
install.packages() function to install it:install.packages("ggplot2")
DESeq is a Bioconductor package (http://www.bioconductor.org) for the analysis of RNA-seq datasource("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")
library(...) function to load the newly installed features:library(ggplot2) # loads ggplot functions
library(DESeq) # loads DESeq functions
library() # Lists all the packages you've got installed
| First_Name | Second_Name | Full_Name | Sex | Age | Weight | Consent | |
|---|---|---|---|---|---|---|---|
| 1 | Adam | Jones | Adam Jones | Male | 50 | 70.8 | TRUE |
| 2 | Eve | Parker | Eve Parker | Female | 21 | 67.9 | TRUE |
| 3 | John | Evans | John Evans | Male | 35 | 75.3 | FALSE |
| 4 | Mary | Davis | Mary Davis | Female | 45 | 61.9 | TRUE |
| 5 | Peter | Baker | Peter Baker | Male | 28 | 72.4 | FALSE |
| 6 | Paul | Daniels | Paul Daniels | Male | 31 | 69.9 | FALSE |
| 7 | Joanna | Edwards | Joanna Edwards | Female | 42 | 63.5 | FALSE |
| 8 | Matthew | Smith | Matthew Smith | Male | 33 | 71.5 | TRUE |
| 9 | David | Roberts | David Roberts | Male | 57 | 73.2 | FALSE |
| 10 | Sally | Wilson | Sally Wilson | Female | 62 | 64.8 | TRUE |
age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 63.5,
71.5, 73.2, 64.8)
firstName <- c("Adam", "Eve", "John", "Mary", "Peter",
"Paul", "Joanna", "Matthew", "David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis",
"Baker","Daniels", "Edwards", "Smith", "Roberts", "Wilson")
TRUE and FALSE:consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,
FALSE, TRUE, FALSE, TRUE)
c(20, "a string", TRUE)
## [1] "20" "a string" "TRUE"
class function: class(firstName)
## [1] "character"
class(age)
## [1] "numeric"
class(weight)
## [1] "numeric"
class(consent)
## [1] "logical"
sex <- c("Male", "Female", "Male", "Female", "Male", "Male",
"Female", "Male", "Male", "Female")
sex
## [1] "Male" "Female" "Male" "Female" "Male" "Male"
## [7] "Female" "Male" "Male" "Female"
factor(sex)
## [1] Male Female Male Female Male Male Female Male
## [9] Male Female
## Levels: Female Male
paste function joins character vectors together)patients <- data.frame(firstName, secondName,
paste(firstName, secondName),
sex, age, weight, consent)
patients
## firstName secondName paste.firstName..secondName. sex
## 1 Adam Jones Adam Jones Male
## 2 Eve Parker Eve Parker Female
## 3 John Evans John Evans Male
## 4 Mary Davis Mary Davis Female
## 5 Peter Baker Peter Baker Male
## 6 Paul Daniels Paul Daniels Male
## 7 Joanna Edwards Joanna Edwards Female
## 8 Matthew Smith Matthew Smith Male
## 9 David Roberts David Roberts Male
## 10 Sally Wilson Sally Wilson Female
## age weight consent
## 1 50 70.8 TRUE
## 2 21 67.9 TRUE
## 3 35 75.3 FALSE
## 4 45 61.9 TRUE
## 5 28 72.4 FALSE
## 6 31 69.9 FALSE
## 7 42 63.5 FALSE
## 8 33 71.5 TRUE
## 9 57 73.2 FALSE
## 10 62 64.8 TRUE
$’ operator:patients$age
## [1] 50 21 35 45 28 31 42 33 57 62
paste command)names function, and we can use the same function to see the names:names(patients) <- c("First_Name", "Second_Name",
"Full_Name", "Sex", "Age", "Weight", "Consent")
names(patients)
## [1] "First_Name" "Second_Name"
## [3] "Full_Name" "Sex"
## [5] "Age" "Weight"
## [7] "Consent"
patients <- data.frame(First_Name = firstName,
Second_Name = secondName,
Full_Name = paste(firstName,secondName),
Sex = sex,
Age = age,
Weight = weight,
Consent = consent)
names(patients)
## [1] "First_Name" "Second_Name"
## [3] "Full_Name" "Sex"
## [5] "Age" "Weight"
## [7] "Consent"
patients$First_Name
## [1] Adam Eve John Mary
## [5] Peter Paul Joanna Matthew
## [9] David Sally
## 10 Levels: Adam David Eve ... Sally
factor:patients <- data.frame(First_Name = firstName,
Second_Name = secondName,
Full_Name = paste(firstName, secondName),
Sex = factor(sex),
Age = age,
Weight = weight,
Consent = consent,
stringsAsFactors = FALSE)
patients$Sex
## [1] Male Female Male Female Male
## [6] Male Female Male Male Female
## Levels: Female Male
patients$First_Name
## [1] "Adam" "Eve" "John"
## [4] "Mary" "Peter" "Paul"
## [7] "Joanna" "Matthew" "David"
## [10] "Sally"
e <- matrix(1:10, nrow=5, ncol=2)
e
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
f <- matrix(1:10, nrow=2, ncol=5)
f
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
f %*% e
## [,1] [,2]
## [1,] 95 220
## [2,] 110 260
The %*% operator is the matrix multiplication operator, not the standard multiplication operator.
object[rows, colums]
e[1,2]
## [1] 6
e[1,]
## [1] 1 6
patients[1,2]
## [1] "Jones"
patients[1,]
## First_Name Second_Name Full_Name
## 1 Adam Jones Adam Jones
## Sex Age Weight Consent
## 1 Male 50 70.8 TRUE
s <- letters[1:5]
s[c(1,3)]
## [1] "a" "c"
s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
## [1] "a" "c"
TRUE and FALSE values to subset the vectora <- 1:5
a < 3
## [1] TRUE TRUE FALSE FALSE FALSE
s[a < 3]
## [1] "a" "b"
<, >, <=, >=, ==, !=!, &, |, xor
TRUE, FALSE)s
## [1] "a" "b" "c" "d" "e"
a
## [1] 1 2 3 4 5
s[a > 1 & a <3]
## [1] "b"
s[a == 2]
## [1] "b"
my.patients using the instructions in the slides. Change the data if you like.sourcesource("1.2_patients.R")
country, continent, and height
country a character vector but continent a factorsummary function on your data frame. What does it do? How does it treat vectors (numeric, character, logical) and factors? (What does it do for matrices?)patients[patients$Age < 40,]
patients[patients$Consent == TRUE,]
patients[patients$Sex == "Male" & patients$Weight >= 70.8,]
read.table()read.csv(), read.delim()write.table()write.csv()rawdata)rawData <- read.delim("1.3_NBcountData.txt")
rawData[1:10,]
sep=",":read.csv("1.3_NBcountData.csv")
?read.table
rawData[1:10,] # View the first 10 rows to ensure import is OK
## Patient Nuclei NB_Amp NB_Nor NB_Del
## 1 1 65 0 63 2
## 2 2 51 3 43 5
## 3 3 37 2 35 0
## 4 4 37 2 35 0
## 5 5 45 2 42 1
## 6 6 46 4 41 1
## 7 7 65 1 64 0
## 8 8 59 1 54 4
## 9 9 49 0 48 1
## 10 10 46 0 45 1
View function to get a display of the data in RStudioView(rawData)
Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham)
NA values, which means the values are missing – a common occurrence in real data collectionNA is a special value that can be present in objects of any type (logical, character, numeric etc)NA is not the same as NULL. NULL is an empty R object. NA is one missing value within an R object (like a data frame or a vector)NAs gracefully, but sometimes we have to tell the functions what to do with them. R has some built-in functions for dealing with NAs, and functions often have their own arguments (like na.rm) for handling themx <- c(1,NA,3)
mean(x)
## [1] NA
mean(x, na.rm=TRUE)
## [1] 2
mean(na.omit(x))
## [1] 2
is.na(x)
## [1] FALSE TRUE FALSE
# create an index of results:
prop <- rawData$NB_Amp / rawData$Nuclei
# Get sample names of amplified patients:
amp <- which(prop > 0.33)
plot(prop, ylim=c(0,1))
abline(h=0.33) # Add a horizonal line
write.csv(rawData[amp,], file="selectedSamples.csv")
getwd() # print working directory
list.files() # list files in working directory
(NB_Amp / Nuclei < 0.33 & NB_Del == 0)norm <- which(prop < 0.33 & rawData$NB_Del == 0)
norm
## [1] 3 4 7 15 20 24 36 37 42 47
write.csv(rawData[norm,], "My_NB_output.csv")
lattice, ggplot2plot will make a scatter plot with the values of the vector on the y axis, and indices in the x axis
patients$Weight
## [1] 70.8 67.9 75.3 61.9 72.4 69.9 63.5
## [8] 71.5 73.2 64.8
plot(patients$Weight)
R tries to guess the most appropriate way to visualise the data
plot. It will put the values from the first argument in the x axis, and values from the second argument on the y axis
patients$Age
## [1] 50 21 35 45 28 31 42 33 57 62
plot(patients$Weight,patients$Age)
plot functionbarplotbarplot(patients$Age)
barplot(summary(patients$Sex))
hist(patients$Weight)
hist(patients$Age)
boxplot(patients$Weight)
y ~ x means put continuous variable y on the y axis and categorical x on the x axisboxplot(patients$Weight~patients$Sex)
ozone.csv
weather <- read.csv("ozone.csv")
View(weather)
| Ozone | Solar.R | Wind | Temp | Month | Day |
|---|---|---|---|---|---|
| 41 | 190 | 7.4 | 67 | 5 | 1 |
| 36 | 118 | 8.0 | 72 | 5 | 2 |
| 12 | 149 | 12.6 | 74 | 5 | 3 |
| 18 | 313 | 11.5 | 62 | 5 | 4 |
| NA | NA | 14.3 | 56 | 5 | 5 |
| 28 | NA | 14.9 | 66 | 5 | 6 |
plot(weather$Solar.R,weather$Ozone)
hist(weather$Temp)
boxplot(weather$Ozone)
boxplot(weather$Ozone~weather$Month)
plot comes with a whole array of arguments that can be set when we call the function
?plot and ?parplot(patients$Weight,type="l")
plot(patients$Weight,type="b")
main argumentplot(patients$Weight,patients$Age,main="Relationship between Weight and Age")
plot(patients$Weight,patients$Age,xlab="Weight")
plot(patients$Weight,patients$Age,ylab="Age")
ylim and xlim are used to specify axis limitsplot(patients$Weight,patients$Age,
ylab="Age",
xlab="Weight",
main="Relationship between Weight and Age",
ylim=c(10,70),
xlim=c(60,80))
"red", "orange","green","blue","yellow"….colours().Changing the col argument to plot changes the colour that the points are plotted in
plot(patients$Weight,patients$Age,col="red")
plot(patients$Weight,patients$Age, pch=16)
plot(patients$Weight,patients$Age, pch="X")
Character expansion
plot(patients$Weight,patients$Age, cex=2)
Character expansion
plot(patients$Weight,patients$Age, cex=0.2)
plot(patients$Weight,patients$Age,
pch = 1:10,
col=c("red","orange","green","blue"),
cex = 1:5)
plot
boxplot(patients$Weight~patients$Sex,
xlab="Sex",
ylab="Weight",
main="Relationship between Weight and Gender",
col=c("blue","yellow"))
breaks and freq arguments to hist (?hist)?rainbow)plot(weather$Solar.R,weather$Ozone,col="orange",pch=16,
ylab="Ozone level",xlab="Solar Radiation",
main="Relationship between ozone level and solar radiation")
hist(weather$Temp,col="purple",xlab="Temperature",
main="Distribution of Temperature",breaks = 50:100,freq=FALSE)
rainbow function is used to create a vector of colours for the boxplot
boxplot(weather$Ozone~weather$Month,col=rainbow(5))
RColorBrewer packagelibrary(RColorBrewer)
display.brewer.all()
boxplot(weather$Ozone~weather$Month,col=brewer.pal(5,"Set1"))